In [13]:
import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )
### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]
data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)
print len(labels), len(features)
Create a decision tree classifier (just use the default parameters), train it on all the data. Print out the accuracy. THIS IS AN OVERFIT TREE, DO NOT TRUST THIS NUMBER! Nonetheless,
In [14]:
from sklearn import tree
from time import time
def submitAcc(features, labels):
return clf.score(features, labels)
clf = tree.DecisionTreeClassifier()
t0 = time()
clf.fit(features, labels)
print("done in %0.3fs" % (time() - t0))
In [15]:
pred = clf.predict(features)
print "Classifier with accurancy %.2f%%" % (submitAcc(features, labels))
Now you’ll add in training and testing, so that you get a trustworthy accuracy number. Use the train_test_split validation available in sklearn.cross_validation; hold out 30% of the data for testing and set the random_state parameter to 42 (random_state controls which points go into the training set and which are used for testing; setting it to 42 means we know exactly which events are in which set, and can check the results you get).
In [16]:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(features, labels, test_size=0.30, random_state=42)
print len(X_train), len(y_train)
print len(X_test), len(y_test)
In [17]:
clf = tree.DecisionTreeClassifier()
t0 = time()
clf.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
In [18]:
pred = clf.predict(X_test)
print "Classifier with accurancy %.2f%%" % (submitAcc(X_test, y_test))
In [27]:
numPoiInTestSet = len([p for p in y_test if p == 1.0])
print numPoiInTestSet
In [29]:
from __future__ import division
1.0 - numPoiInTestSet/29
Out[29]:
Aaaand the testing data brings us back down to earth after that 99% accuracy.
Accuracy is not particularly good if any of these cases apply to you. Precision and recall are a better metric for evaluating the performance of the model.
As you may now see, having imbalanced classes like we have in the Enron dataset (many more non-POIs than POIs) introduces some special challenges, namely that you can just guess the more common class label for every point, not a very insightful strategy, and still get pretty good accuracy!
Precision and recall can help illuminate your performance better.
precision_score
and recall_score
available in sklearn.metrics
to compute those quantities.
In [30]:
from sklearn.metrics import *
In [32]:
precision_score(y_test,clf.predict(X_test))
Out[32]:
Obviously this isn’t a very optimized machine learning strategy (we haven’t tried any algorithms besides the decision tree, or tuned any parameters, or done any feature selection), and now seeing the precision and recall should make that much more apparent than the accuracy did.
In [33]:
recall_score(y_test,clf.predict(X_test))
Out[33]:
In [31]:
y_true = y_test
y_pred = clf.predict(X_test)
cM = confusion_matrix(y_true, y_pred)
print "{:>72}".format('Actual Class')
print "{:>20}{:>20}{:>20}{:>23}".format('Predicted', '', 'Positive', 'Negative')
print "{:>20}{:>20}{:>20.3f}{:>23.3f}".format('', 'Positive', cM[0][0], cM[0][1])
print "{:>20}{:>20}{:>20.3f}{:>23.3f}".format('', 'Negative', cM[1][0], cM[1][1])
In [ ]: